36 


IEEE  TRANSACTIONS  ON  IMAGE  PROCESSING,  VOL.  18,  NO.  1,  JANUARY  2009 


Generalizing  the  Nonlocal-Means  to 
Super-Resolution  Reconstruction 

Matan  Protter,  Michael  Elad,  Senior  Member,  IEEE ,  Hiroyuki  Takeda,  Student  Member,  IEEE ,  and 

Peyman  Milanfar,  Senior  Member,  IEEE 


Abstract — Super-resolution  reconstruction  proposes  a  fusion 
of  several  low-quality  images  into  one  higher  quality  result  with 
better  optical  resolution.  Classic  super-resolution  techniques 
strongly  rely  on  the  availability  of  accurate  motion  estimation 
for  this  fusion  task.  When  the  motion  is  estimated  inaccurately, 
as  often  happens  for  nonglobal  motion  fields,  annoying  artifacts 
appear  in  the  super-resolved  outcome.  Encouraged  by  recent  de¬ 
velopments  on  the  video  denoising  problem,  where  state-of-the-art 
algorithms  are  formed  with  no  explicit  motion  estimation,  we 
seek  a  super-resolution  algorithm  of  similar  nature  that  will  allow 
processing  sequences  with  general  motion  patterns.  In  this  paper, 
we  base  our  solution  on  the  Nonlocal-Means  (NLM)  algorithm. 
We  show  how  this  denoising  method  is  generalized  to  become 
a  relatively  simple  super-resolution  algorithm  with  no  explicit 
motion  estimation.  Results  on  several  test  movies  show  that  the 
proposed  method  is  very  successful  in  providing  super-resolution 
on  general  sequences. 

Index  Terms — Nonlocal-means,  probabilistic  motion  estimation, 
super-resolution. 


I.  Introduction 

SUPER-RESOLUTION  reconstruction  proposes  a  fusion  of 
several  low  quality  images  into  one  higher  quality  result 
with  better  optical  resolution.  This  is  an  Inverse  Problem  that 
combines  denoising,  deblurring,  and  scaling-up  tasks,  aiming  to 
recover  a  high  quality  signal  from  degraded  versions  of  it.  Fig.  1 
presents  the  process  that  explains  how  a  low-resolution  image 
sequence  y*  is  related  to  an  original  higher  resolution  movie 
x*.  During  the  imaging,  the  scene  may  become  blurred  due  to 
atmospheric,  lens,  or  sensors’  effects.  The  blur  is  denoted  by 
H,  assumed  for  simplicity  to  be  linear  space  and  time  invariant. 
Similarly,  the  loss  of  spatial  resolution  due  to  the  sensor  array 
sampling  is  modeled  by  the  fixed  decimation  operator  D,  rep¬ 
resenting  the  resolution  factor  s  between  the  original  sequence, 
and  the  measured  one.  White  Gaussian  iid  noise  is  assumed  to 
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Fig.  1.  Imaging  process  to  be  reversed  by  super  resolution. 


be  added  to  the  measurements,  both  in  order  to  refer  to  actual 
noise  in  imaging  systems,  as  well  as  for  accommodating  model 
mismatches. 

The  super  resolution  goal  is  the  recovery  of  x*  from  the  input 
set  of  images  y^,  reversing  the  above  process.  Such  reconstruc¬ 
tion  relies  on  motion  in  the  scene  to  recover  details  that  are 
finer  than  the  sampling  grid.  Fig.  2  demonstrates  how  small  de¬ 
tails  can  be  recovered  when  the  motion  between  the  images  in 
the  sequence  is  known  with  a  high  degree  of  accuracy.  The  top 
row  in  the  figure  is  the  input  sequence.  The  middle  row  is  the 
up-scaled  version  of  each  image  (unknown  values  are  set  to  a 
background  color),  shifted  by  the  known  translation  between 
the  current  image  and  the  first  (reference)  image.  The  bottom 
row  shows  the  construction  of  the  super  resolution  image,  from 
left  to  right.  Initially,  the  first  image  is  placed  on  the  grid.  Then, 
every  new  image  in  the  sequence  is  placed  on  the  same  grid,  with 
a  displacement  reflecting  the  motion  it  underwent.  The  merger 
of  all  images  represents  the  outcome  of  the  super  resolution  al¬ 
gorithm.  We  note  that  this  description  of  the  mechanics  of  super¬ 
resolution  is  somewhat  simplistic;  In  most  cases,  one  cannot  as¬ 
sume  the  translations  to  be  exact  multiples  of  the  high-resolu- 
tion  pixel  sizes.  This  makes  the  estimation  of  accurate  motion 
parameters  and  the  merger  of  all  images  much  more  complex 
than  described  here. 

While  the  above-described  method  is  somewhat  simplistic,  it 
is  a  faithful  description  of  the  foundations  for  all  classic  super 
resolution  algorithms.  The  first  step  of  such  algorithms  is  an 
estimation  of  the  motion  in  the  sequence,  followed  by  a  fusion 
of  the  inputs  according  to  these  motion  vectors.  A  wide  variety 
of  super  resolution  algorithms  have  been  developed  in  the  past 
two  decades;  we  refer  to  [1]— [26]  as  representatives  of  this  vast 
literature. 

In  the  currently  available  super-resolution  algorithms,  only 
global  motion  estimation  (e.g.,  translation  or  affine  global 
warp)  is  accurate  enough  to  lead  to  a  successful  reconstruction 
of  a  super-resolved  image.  This  is  very  limiting,  as  most  actual 
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Fig.  2.  Super  resolving  an  image  using  low  resolution  inputs  with  known  trans¬ 
lations.  Reconstruction  proceeds  from  left  to  right.  Top:  input  images;  middle: 
corresponding  up-scaled  images,  shifted  using  known  translations;  bottom:  ac¬ 
cumulated  reconstruction  by  adding  the  current  low  resolution  image  to  the 
output  canvas. 


scenes  contain  motion  that  is  local  in  its  nature  (e.g.,  a  person 
talking).  Obtaining  highly  accurate  local  motion  estimation, 
known  as  optical  flow,  is  a  very  difficult  task,  and  particularly 
so  in  the  presence  of  aliasing  and  noise.  When  inaccurately 
estimated  motion  is  used  within  one  of  the  existing  recon¬ 
struction  algorithms,  it  often  leads  to  disturbing  artifacts  that 
cause  the  output  to  be  inferior,  even  when  compared  to  the 
given  measurements.  This  discussion  leads  to  the  commonly 
agreed  unavoidable  conclusion  that  general  content  movies 
are  not  likely  to  be  handled  well  by  classical  super-resolution 
techniques. 

This  severe  restriction  leads  us  to  seek  a  different  approach  to 
super-resolution.  Can  such  an  algorithm  be  proposed  with  no  ex¬ 
plicit  motion  estimation?  Our  starting  point  for  this  quest  (after  a 
super-resolution  algorithm  that  is  able  to  process  sequences  with 
a  general  motion  pattern)  is  the  video  denoising  application, 
where  several  recent  contributions  demonstrate  state-of-the-art 
results  with  algorithms  that  avoid  motion  estimation  [27]— [30]. 
Among  these,  we  choose  to  take  a  closer  look  at  the  Nonlocal 
Means  (NLM)  algorithm,  with  the  aim  to  generalize  it  to  per¬ 
form  super-resolution  reconstruction. 

The  NLM  is  the  weakest  among  the  recent  motion-estima- 
tion-free  video  denoising  algorithms,  and  yet,  it  is  also  the  sim¬ 
plest.  As  such,  it  stands  as  a  good  candidate  for  generaliza¬ 
tion.  The  NLM  is  posed  originally  in  [31]  as  a  single  image 
denoising  method,  generalizing  the  well-known  bilateral  filter 
[32],  [33].  Denoising  is  obtained  by  replacing  every  pixel  with 
a  weighted  average  of  its  neighborhood.  The  weights  for  this 
computation  are  evaluated  by  using  block-matching  fit  between 
image  patches  centered  around  the  center  pixel  to  be  filtered,  and 
the  neighbor  pixels  to  be  averaged.  Recent  work  has  shown  how 
this  method  can  be  used  for  video  denoising  by  extending  the 
very  same  technique  to  3-D  neighborhoods  [27].  An  improve¬ 
ment  of  this  technique,  considering  varying  size  neighborhoods 
is  suggested  in  [28],  so  as  to  trade  bias  versus  variance  in  an  at¬ 
tempt  to  get  the  best  mean-squared-error  (MSE). 

The  NLM  was  proposed  intuitively  in  [27]  and  [31],  and, 
thus,  it  is  natural  to  try  to  extend  it  to  perform  super-resolu¬ 
tion  using  a  similar  intuition.  This  intuition  leads  to  independent 
up-scaling  of  each  image  in  the  sequence  using  a  smart  interpo¬ 
lation  method,  followed  by  NLM  processing.  However,  exten¬ 
sive  experiments  indicate  that  this  intuitive  method  does  not  pro¬ 
vide  super-resolution  results.  For  this  reason,  a  more  profound 
understanding  of  the  NLM  filter  is  required  for  its  successful 
generalization  to  super-resolution. 

In  order  to  gain  a  better  understanding  of  the  NLM,  we  pro¬ 
pose  redefining  it  as  an  energy  minimization  task.  We  show  that 


the  novel  penalty  term  we  propose  indeed  leads  when  minimized 
to  the  NLM.  We  then  carefully  extend  the  penalty  function  to 
the  super-resolution  problem.  We  show  how  a  tractable  algo¬ 
rithm  emerges  from  the  minimization  of  this  penalty  function, 
leading  to  a  local,  patch-based,  super-resolution  process  with  no 
explicit  motion  estimation.  Empirical  tests  of  the  derived  algo¬ 
rithm  on  actual  sequences  with  general  motion  patterns  are  then 
presented,  thus  demonstrating  the  capabilities  of  the  derived  al¬ 
gorithm. 

The  structure  of  the  paper  is  as  follows.  Section  II  describes 
the  NLM  denoising  filter,  as  posed  in  [31].  This  section  can  be 
skipped  by  readers  who  are  familiar  with  the  NLM.  Section  III 
introduces  an  energy  function  to  be  minimized  for  getting  a  de¬ 
noising  effect  for  a  single  image;  We  show  that  this  minimiza¬ 
tion  leads  to  a  family  of  image  denoising  algorithms,  NLM  in¬ 
cluded  as  a  special  case.  We  also  provide  a  simpler  penalty  func¬ 
tion  addressing  the  same  goal,  which  will  be  effectively  used 
in  the  later  part  of  the  paper.  Section  IV  proposes  a  general¬ 
ization  of  the  introduced  energy  function  to  cope  with  resolu¬ 
tion  changes,  thereby  enabling  super-resolution  reconstruction. 
In  this  section  we  also  derive  the  eventual  super-resolution  al¬ 
gorithm  we  propose,  and  discuss  its  numerical  structure.  Sec¬ 
tion  V  shows  results  on  sequences  with  general  motion,  demon¬ 
strating  the  successful  recovery  of  high  frequencies.  We  con¬ 
clude  in  Section  VI,  outlining  the  key  contribution  of  this  work, 
and  describing  several  directions  for  further  research. 

II.  Bilateral  and  the  NLM  Denoising  Filters 

We  begin  our  journey  with  a  description  of  the  bilateral  and 
the  NLM  filters,  as  the  development  that  follows  relies  on  their 
structure.  The  description  given  in  this  section  is  faithful  to  the 
one  found  in  [31]  and  [32].  The  bilateral  and  the  NLM  filters  are 
two  very  successful  image  denoising  filters.  While  not  the  very 
best  in  denoising  performance,  these  methods  are  very  simple  to 
understand  and  implement,  and  this  makes  them  a  good  starting 
point  for  our  needs. 

Both  the  bilateral  and  the  NLM  filters  are  based  on  the  as¬ 
sumption  that  image  content  is  likely  to  repeat  itself  within  some 
neighborhood.  Therefore,  denoising  each  pixel  is  done  by  aver¬ 
aging  all  pixels  in  its  neighborhood.  This  averaging  is  not  done 
in  a  blind  and  uniform  way,  however.  Instead,  each  of  the  pixels 
in  the  relevant  neighborhood  is  assigned  a  weight,  that  reflects 
the  probability  that  this  pixel  and  the  pixel  to  be  denoised  had 
the  same  value,  prior  to  the  additive  noise  degradation.  A  for¬ 
mula  describing  these  filters  looks  like1 

r  ,  _  T,M)wkj)v[WihMi,j] 

where  N(k,l)  stands  for  the  neighborhood  of  the  pixel  (&,/), 
and  the  term  w[k ,  is  the  weight  for  the  (i,  j)-th  neighbor 
pixel.  The  input  pixels  are  y[fc,Z],  and  the  output  result  in  that 
location  is  x[&,/]. 

The  two  filters  differ  in  the  method  by  which  the  weights 
are  computed.  The  weights  for  the  bilateral  filter  are  computed 

*As  we  shall  see  next,  in  this  framework  the  coefficients  w[k,  /,  i,j]  are  all 
restricted  to  be  positive.  This  is  a  shortcoming,  which  can  be  overcome  by  ex¬ 
tending  the  framework  to  higher  order — see  [34]. 
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based  both  on  radiometric  (gray-level)  proximity  and  geometric 
proximity  between  the  pixels,  namely 


WBL[k,l,i,j]  =  exp 


(y[M]-yM)2l 

J 

•/  (V(fc  -  «)2 + (i  -  j)2)  •  (2) 


The  function  /  takes  the  geometric  distance  into  account,  and 
as  such,  it  is  monotonically  nonincreasing.  It  may  take  many 
forms,  such  as  a  Gaussian,  a  box  function,  a  constant,  and  more. 
The  parameter  oy  controls  the  effect  of  the  grey-level  difference 
between  the  two  pixels.  This  way,  when  the  two  pixels  that  are 
markedly  different,  the  weight  is  very  small,  implying  that  this 
neighbor  is  not  to  be  trusted  in  the  averaging. 

The  radiometric  part  in  the  weights  of  the  NLM  is  computed 
slightly  differently,  by  computing  the  Euclidean  distance  be¬ 
tween  two  image  patches  centered  around  these  two  involved 
pixels.  Defining  R^i  as  an  operator  that  extracts  a  patch  of  a 
fixed  and  predetermined  size  (say  q  x  q  pixels)  from  an  image, 
the  expression  Rk,iy  (y  is  represented  as  a  vector  by  lexico¬ 
graphic  ordering)  results  with  a  vector  of  length  q 2  being  the 
extracted  patch.  Thus,  the  NLM  weights  are  given  by 


wNLM[k,l,i,j]  =  exp 


llfc./y-fe.jyllll 

2^  J 

■/(v'(fe-*)2  +  (i-J')2)-  (3) 


Obviously,  setting  R^i  to  extract  only  a  single  pixel,  the  bilat¬ 
eral  filter  emerges  as  a  special  case  of  the  NLM  algorithm. 

We  note  that  there  are  various  other  ways  to  choose  the 
weights  in  (1),  and  the  above  separable  choice  of  the  weights 
(product  of  radiometric  and  Euclidean  distance  terms)  is  only 
one  choice.  Lor  example,  the  steering  kernel  may  provide 
an  interesting  alternative,  taking  into  account  the  correlation 
between  the  pixel  positions  and  their  value  [34].  Nevertheless, 
in  this  paper  we  shall  restrict  our  choice  of  weights  to  those 
used  by  the  NLM. 


III.  NLM  via  Energy  Minimization 

Both  the  bilateral  and  the  NLM  filters  described-above  were 
presented  intuitively  as  algorithmic  formulas,  as  in  (1).  We 
claim  that  both  these  filters  can  be  derived  by  minimizing  a 
properly  defined  penalty  function.  Following  the  rationale  and 
steps  taken  in  [33]  and  [35],  we  present  such  a  penalty  function, 
and  show  how  these  algorithms  emerge  from  it.  This  will 
prove  valuable  when  taking  the  next  step  of  generalizing  these 
methods  to  a  super-resolution  reconstruction  algorithm,  as  will 
be  shown  in  Section  IV.  Sections  III-A  and  III-C  present  two 
possible  and  novel  penalty  functions  for  denoising,  and  derive 
from  both  the  NLM  algorithm  and  some  variations  of  it.  The 
readers  interested  in  the  super-resolution  portion  of  this  work 
can  start  their  reading  in  Section  III-C. 


A.  Penalty  Function 

The  penalty  function  we  start  with  reflects  two  forces:  i)  We 
desire  a  proximity  between  the  reconstructed  and  the  input  im¬ 
ages — this  is  the  classic  likelihood  term;  and  ii)  We  would  like 
each  patch  in  the  resulting  image  to  resemble  other  patches  in 
its  vicinity.  However,  we  do  not  expect  such  a  fit  for  every  pair, 
and,  thus,  we  introduce  weights  to  designate  which  of  these 
pairs  are  to  behave  alike.  Putting  these  two  forces  together  with 
proper  weighting,2  we  propose  a  maximum  a  posteriori  proba¬ 
bility  (MAP)  penalty  of  the  form 

e2(x)  =  L||x-y||I 

+1*  (4) 

(k,i)en  (i,j)eN(k,i) 

The  first  term  is  the  log-likelihood  function  for  a  white  and 
Gaussian  noise.  The  second  term  stands  for  a  prior,  representing 
the  (minus)  log  of  the  probability  of  an  image  x  to  exist.  The 
weights  in  the  above  expression,  are  assigning  a 

confidence  that  the  patches  around  x[&,Z]  and  x[i,j\  are  to  be 
close  to  each  other.  Computing  these  weights  can  be  done  in  a 
number  of  ways,  one  of  which  is  using  (3)  and  using  y  instead 
of  the  unknown  x.  In  order  to  keep  the  discussion  simple,  from 
this  point  on  we  shall  assume  that  the  weights  w[k,l,i,j]  are 
the  NLM  ones.  It  is  important  to  note  that  the  patch  extraction 
operator  R^i  used  for  computing  the  weights  (as  in  (3))  and  the 
operator  R*./  used  in  the  penalty  term  are  generally  of  different 
sizes. 

The  notation  fl  stands  for  the  support  of  the  entire  image. 
Thus,  the  second  term  sweeps  through  each  and  every  pixel 
(&,  l)  in  the  image,  and  for  each  we  require  a  proximity  to  sur¬ 
rounding  patches  in  its  neighborhood. 


B.  Derivation  of  the  NLM  Filter 

Assuming  the  weights  are  predetermined  and  considered  as 
constants,  we  can  minimize  this  penalty  term  with  respect  to  x 
by  zeroing  its  derivative 

^C^=A(x-y)  +  i  ^ 

(k,l)en  ( i,j)€N(k,l ) 

=  0.  (5) 

In  order  to  simplify  this  equation,  we  open  the  brackets 


de2(x) 

dx 


=  0  =  A(x  —  y) 

+  5  E  E  w[k,l,iyj]Rljnkjx 

(k,l)€Q  (■ i,j)eN(k,l ) 

(fe,0€fi  (i,j)€N(k,l) 


2The  reason  for  the  factor  1/4  in  the  second  term  will  be  made  clear  shortly. 
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(k,i)en(i,j)eN(k,i) 

+  2  X  X  w[k,l,i,j\B%jRitjic.  (6) 

(k,i)en  (i,j)eN(k,i) 

We  proceed  by  invoking  two  assumptions:  i)  The  neighborhood 
is  symmetric,  i.e.,  if  (i,j)  G  N(k,l),  then  (k,l)  G  N(i,j)  is 
also  true;  and  ii)  The  weights  are  symmetric,  i.e.,  w[k,l,i,j]  = 
w[i,  j,  k ,  /].  Both  these  assumptions  are  natural — typical  neigh¬ 
borhood  definitions  satisfy  the  first  condition,  and  the  weights 
of  the  NLM  satisfy  the  second  one.  Using  these  two  assump¬ 
tions,  we  get  the  following  two  equalities: 

X  X  w[k,l,iJ]RljRk,iX 

(M)efi  ( i,j)eN(k,i ) 

=  X  X 

(k,i)en(ij)€N(k,i) 

X  Y  w[k,l,iJ]RljRk,ix 

(k,l)etl  ( i,j)£N(k,l ) 

=  X  X  w[k,l,i,  j]  Rfj  Ri,j  x. 

(k,i)en(i,j)eN(k,i) 

A  formal  proof  (Theorem  1)  for  these  equalities  is  given  in  Ap¬ 
pendix  A.  Using  these,  (6)  simplifies  and  becomes 

0  =  A(x  -  y)  +  Y  X  j]Rk^Rk,ix 

(k,i)en  ( i,j)eN(k,i ) 

-  Y  X  w[k,l,i,j\RltlRijX.  (7) 

(k,i)en(i,j)eN(k,i) 

As  can  be  seen,  the  choice  1/4  in  the  original  definition  led  to  a 
simpler  final  outcome. 

While  solving  the  above  equation  directly  is  possible  in  prin¬ 
ciple,  it  requires  an  inversion  of  a  very  large  matrix.  Instead, 
we  adopt  an  iterative  approach  based  on  the  fixed-point  strategy 
[36],  [37].  Denoting  xn_1  the  outcome  of  the  previous  iteration, 
and  xn  the  desired  outcome  of  the  current  iteration,  we  rewrite 
(7)  with  assignments  of  iteration  stage  per  each  instance  of  the 
unknown  x.  The  equation  we  propose  is 

0  =  A(xn  —  y)  +  Y  X  w[k,l,iJ]RljRkyixn 

C k,i)en(i,j)eN(k,i ) 

-  X  X  w[kJ>iJ]Rk,iRi,jXn-1  (8) 

(k,i)en(i,j)eN(k,i) 

which  leads  to  the  relation 


(*+  E 

X 

\  (Men 

rmrm 


=  Ay  +  X  X  wikiliiJ]Kh'R^,jXn  1- 

(k,i)en  (i,j)eN(k,i) 


(9) 


Notice  that  the  term  j)eN(k,i)  w\h->  U  hj]  generates  a  scalar, 
being  a  function  of  ( k ,  l) — we  shall  denote  this  as  w[k,  /]. 

In  the  obtained  equation,  the  right-hand-side  (RHS)  creates  an 
image  by  manipulating  image  patches:  for  each  location  (&,  l) 
in  the  image,  we  copy  surrounding  neighboring  patches  in  loca¬ 
tions  (i,j)  to  the  center  position  (&,  /),  multiplied  by  the  weights 


w[k,  Once  built,  this  image  is  added  to  y  with  a  proper 
weight  A. 

The  matrix  multiplying  xn  on  the  left-hand-side  is  a  diagonal 
positive  definite  matrix  (see  Appendix  A).  This  matrix’s  only 
task  is  normalization  of  the  weighted  average  that  took  place 
on  the  RHS.  As  this  matrix  is  invertible,  the  new  solution  is 
obtained  by 

x"=  ( AI+  Y.  w[MRJ,iIL,i 
\  (k,i)en 

x(Ay+  X  X  w[kJ,iJ]RkjRi,jXn-1  I  •  (10) 

\  ( k,i)en(i,j)eN(k,i )  J 


When  using  the  fixed-point  method,  as  we  did  above,  every 
appearance  of  the  unknown  in  the  equation  is  assigned  with  an 
iteration  number.  Among  the  many  possible  assignments,  one 
should  seek  one  that  satisfies  two  important  conditions:  i)  The 
computation  of  xn  from  xn_1  should  be  easy;  and  ii)  The  ob¬ 
tained  iterative  formula  should  lead  to  convergence.  As  for  the 
first  requirement,  we  indeed  have  an  assignment  that  leads  to 
a  simple  iterative  step.  Convergence  of  the  above  algorithm  is 
guaranteed  if  the  overall  operator  multiplying  xn_1  is  conver¬ 
gent,  i.e., 


AH-  X  w[MR l,zRW 

(k,i)en 


x  (  X  X  \  <  1  (11) 

(k,l)€Cl(i,j)€N(k,l) 


where  p{  A}  is  the  spectral  radius  of  A.  It  is  easily  seen  that  for 
sufficiently  large  A,  this  condition  is  met.  Nevertheless,  we  do 
not  worry  about  convergence,  as  we  will  be  using  the  above  for 
one  iteration  only,  with  the  initialization  of  x°  =  y .  This  means 
that  the  output  for  the  denoising  process  is  x  =  x1,  obtained  by 

x=|aI+  Y  fi)[M]Rfc,zRM 

V  (k,i)en 


AI+  Y  X  MM,«J]rmr; 

(k,i)en  (ij)eN(k,i) 


(12) 


The  above  computation  is  done  just  as  described  above,  with  the 
obvious  substitution  of  x°  =  y.  This  process  is  quite  reminis¬ 
cent  of  the  way  NLM  and  the  bilateral  filters  operate,  and  yet  it 
is  different.  The  obtained  algorithm  is  a  more  general  and  more 
powerful  denoising  algorithm  than  NLM. 

In  order  to  see  how  the  NLM  emerges  from  this  formulation 
as  a  special  case,  we  shall  assume  further  that  the  patch  extrac¬ 
tion  operation  we  use,  Rj  j,  extracts  a  single  pixel  in  location 
This  change  means  that  Ri  jy  is  in  fact  the  single  pixel 
y [i,j].  Thus,  in  this  case  we  have 

x=|aI+  Y  ^[M]Rfc,zRM 
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Ay+  rm  Z 


(13) 


The  term  R^iX  implies  that  a  zero  image  is  generated,  and  the 
value  x  is  inserted  to  location  (&,  /).  Using  this  property  in  (12), 
we  obtain  a  pixel-wise  denoising  formula 


x[k ,  Z] 


Ay[M]  +  Y,a,j)€N(k,i)w[k’iiiJ]y[iJ] 


(14) 


i(fe,0eO(i,j)eiV(fe,0 

=  (  ^  tT,[U]R£,Rw 

(k,i)en 


x(  Z  Z  MM,*,j]Ri,iy  )•  (17) 

v(fc,0Gf2  ( i,j)EN(k,l ) 


This  formula  seems  familiar,  as  it  is  effectively  the  same  as 
the  one  in  (4),  with  a  slight  modification  due  to  the  additional 
A.  This  means  that  under  the  simplifying  assumption  we  made 
about  Rj  j,  the  first  iteration  of  the  developed  algorithm  reduces 
to  NLM.  A  natural  question  to  ask  is:  Does  the  above  formula 
still  stands  for  a  patch-based  algorithm?  The  answer  is  positive, 
as  the  weights  may  be  computed  using  patches,  as  the  NLM 
does.  Thus,  the  next  natural  question  to  ask  is:  Why  have  we 
used  a  patch  extraction  operation  in  the  definition  of  the  penalty 
in  (4)?  The  answer  to  this  is  the  generalization  that  follows  next 
for  the  super-resolution  case. 


C.  Bayesian  Versus  Proximity — An  Alternative  Path 

The  penalty  term  in  (4)  was  formed  like  a  MAP  estimator, 
having  a  likelihood  term  that  ties  the  measurements  to  the  un¬ 
known,  and  a  regularization  term,  posing  a  prior  on  the  desired 
outcome 

c2(x)=  Z||x-y||i  +  i-  Z  wlk^hj\ 

(k,l)€Q(i,j)€N(k,l) 

•||R^x-R^x|||. 


In  the  previous  section,  we  developed  a  general  denoising 
scheme  based  on  the  above  penalty,  showing  that  the  NLM  and 
the  bilateral  filters  arise  from  it  as  special  cases.  Recall  that  one 
of  the  last  steps  in  the  derivation  was  the  use  of  a  single  iteration 
and  initialization  with  x°  =  y.  This  basically  means  that  the 
prior  term  requires  pixels  in  the  output  image  x1  to  have  similar 
grey  values  to  pixels  in  the  corresponding  neighborhood  in  y, 
provided  they  have  similar  surrounding.  This  idea  can  be  put 
directly  into  the  penalty  term,  giving 


??2(x) 


\  E  E  w[k,l,i,j\  •  ||RMx  -  Rij-ylH. 

(k,l)€Q  ( i,j)€N(k,l ) 


(15) 

Targeting  now  the  minimization  of  this  penalty  term,  we  null  the 
derivative  of  this  function  with  respect  to  x,  getting  the  equation 


drj2(x) 

dx 


=  o  =  Z  Z  MM>*,i]Rfc,z 

C k,i)en(i,j)eN(k,i ) 


x(R/e,/x  —  R^jy). 


(16) 


This  time  we  do  not  need  the  fixed-point  strategy,  as  a  simple 
solution  is  easily  derived,  leading  to 


-l 


x  = 


E  E  w[Kl,iJ]RljRk,i 

(k,l)€Q(i,j)<EN(k,l) 


Notice  the  resemblance  between  this  formula  and  the  one  ob¬ 
tained  by  a  different  route  in  (12).  In  fact,  for  A  =  0,  the  two  for¬ 
mulas  are  the  same.  Furthermore,  following  the  same  assump¬ 
tions  and  steps  that  led  us  from  (12)  to  (14),  it  is  clear  that 
pixel-wise,  the  above  aligns  perfectly  with  the  NLM  filter,  as 
written  in  (1),  namely 

,r7  n  J2(i,j)€N(k,i)  y[hi\ 

XIM  =  - — - r,  ;  •  -1 - • 

2-)( i,j)eN(k,i )  wVe*  ^ 


D.  Introducing  the  Temporal  Domain 

We  shall  use  hereafter  the  penalty  function  in  (15)  to  derive 
further  algorithms,  due  to  its  simplicity,  and  the  fact  that  it  leads 
directly  to  the  NLM.  Before  we  turn  to  super-resolution,  we 
must  introduce  the  temporal  axis  into  the  penalty  function,  so 
as  to  process  a  sequence  of  images  and  not  just  a  single  one,  as 
done  so  far.  The  following  small  changes  in  (15)  lead  to  such  a 
treatment 

*7t(x)  =  \  Z  Z  Z 

(fc,Z)€ft  t€[l,...,T]  0 i,j)€N(k,l ) 

•HRfe^x-Ri^ytlll.  (18) 

The  above  expression  makes  use  of  the  input  sequence  {yt  }J=i> 
summing  over  all  these  images.  The  term  R^;J-yt  remains  a  2-D 
patch,  but  one  that  is  extracted  at  location  (i,j)  from  the  image 
at  time  t.  The  image  x  remains  a  single  image,  representing 
the  desired  output  image  to  be  created.  Let  us  assume  that  this 
image  aims  to  become  a  denoised  version  of  yto .  This  fact  is 
used  only  within  the  computation  of  the  weights  w[k,l,i,j,t], 
which  matches  a  reference  patch  Rfc,zy*0  with  the  candidate 
ones  TLijyt,  both  2-D.  The  temporal  distance,  t  —  to  can  also 
influence  the  weight  w,  just  as  spatial  distances  k  —  i  and  l  —  j 
do  in  (3). 

Assuming  that  the  patches  extracted  by  the  operator  R  are  of 
size  of  one  pixel,3  using  the  same  algebraic  steps  as  shown  in 
the  previous  sub-section  we  get  a  closed-form  formula  for  the 
computation  of  x 

£Tkn  =  ^te[i,...,T]  J2(i,j)eN(k,i)  ^ 

Z^e[i,...,T]  S(*,i)€JV(fe,z) 

This  is  a  generalization  of  the  NLM  to  handle  video  sequences, 
and  was  shown  to  be  an  effective  algorithm  in  [27]. 

Let  us  explain  this  formula  in  words,  as  a  similar  structure 
will  serve  us  in  the  next  Section  quite  similarly.  Each  output 

3We  remind  the  reader  that  the  patch  size  related  to  the  weights-computation 
is  different  than  the  one  in  the  penalty  function  itself  in  general. 
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pixel  is  computed  as  a  weighted  average  of  pixels  in  its  3-D 
neighborhood  in  the  input  sequence  y .  We  would  like  pixels  that 
originally  had  the  same  grey  level  to  have  high  weights,  and  the 
weights  are  computed  in  a  way  that  reflects  this. 

By  taking  a  slightly  different  perspective,  we  can  regard  the 
weight  w[k,l,i,j)t]  as  reflecting  the  (relative)  probability  that 
the  pixel  (&,  l)  in  the  image  to  be  denoised  has  gone  to  (or  come 
from)  the  pixel  (i,j)  in  the  image  yt.  This  is  a  very  basic  mo¬ 
tion  estimation  approach  for  each  pixel.  However,  since  for  each 
(A:,/),  there  may  be  several  nonzero  values,  this  implies  that 
each  pixel  may  be  assigned  with  several  motion  vectors,  and 
not  just  one,  as  in  classical  motion  estimation  methods.  Indeed, 
the  above  expression  suggests  that  all  motion  vectors  are  al¬ 
lowed,  and  considered  according  to  the  probability  that  they  ac¬ 
tually  took  place  based  on  the  patch  matching  they  exhibit.  This 
way,  one  could  consider  the  above  as  a  fuzzy  motion  estimation 
process,  rather  than  an  explicit  one.  Fig.  3  presents  the  concept 
of  fuzzy  motion  estimation.  A  specific  pixel  on  the  scene  at  time 
t  (the  left  image),  has  several  equally  probable  destinations  in 
the  surrounding  images  (including  the  same  image  at  time  t ). 
This  can  be  done  for  any  pixel,  some  having  many  probable  des¬ 
tinations,  and  some  only  a  few. 

An  important  benefit  to  the  above  fuzziness  is  the  ability  to 
get  a  stronger  denoising  effect.  Whereas  exact  motion  estima¬ 
tion  leads  to  a  correspondence  between  the  reference  patch  and 
a  single  match  in  every  other  image,  the  NLM  approach  may 
find  several  good  matches  due  to  spatial  (and  not  just  temporal) 
redundancy.  As  seen  in  Fig.  3,  several  very  good  matches  can  be 
found  and,  thus,  when  averaged,  lead  to  a  more  effective  noise 
suppression  due  to  the  independence  between  the  noise  in  each 
of  those  patches. 

Generalization  of  the  NLM  approach  to  video  denoising  was 
shown  to  be  very  effective  [27] — [29],  leading  to  state-of-the-art 
results,  while  leaning  strongly  on  the  fuzziness  of  the  estimated 
motion.  These  methods’  successes  encourage  us  to  seek  ways 
to  exercise  the  same  fuzzy  motion  estimation  concept  in  other 
video  processing  tasks,  where  it  is  commonly  assumed  that  ex¬ 
plicit  motion  estimation  is  a  mandatory  step.  One  such  task  is 
super-resolution.  In  the  rest  of  this  paper  we  focus  on  a  super¬ 
resolution  scheme  that  relies  on  fuzzy,  rather  than  explicit,  mo¬ 
tion  estimation,  generalizing  the  NLM  as  presented  above. 

IV.  Super  Resolution  With  No  Explicit 
Motion  Estimation 

A.  Super-Resolution  Penalty  Function 

The  main  advantage  of  the  fuzzy  motion  approach  is  its  flex¬ 
ibility,  allowing  it  to  handle  complex  scenes.  Whereas  classic 
super  resolution  methods  make  many  limiting  assumptions  re¬ 
garding  the  motion  in  the  scene,  this  is  not  the  case  with  fuzzy 
motion  estimation,  which  can  handle  local  motions,  occlusions, 
and  so  on.  Our  goal  in  this  section  is  to  exploit  this  flexibility 
to  perform  super-resolution  on  sequences  with  arbitrary  motion 
patterns,  thus  avoiding  the  global  motion  limitation. 

Developing  a  super  resolution  algorithm  will  again  begin  with 
writing  a  proper  penalty  term  and  minimizing  it.  The  input  to 
this  algorithm  is  the  set  of  low  resolution  images  yt.  For  clarity 


^  X 

lf».  ' 

w 

-  - 

t-1  t  t+1  t+2 

Fig.  3.  NLM’ s  fuzzy  motion  estimation:  A  patch  in  the  reference  image  at  time 
t  (marked  with  a  thick  line)  has  several  probable  locations  in  the  other  images, 
and  also  within  the  same  image  itself  (marked  with  narrow  lines). 

of  description,  we  target  one  high  resolution  image  X  at  a  time 
(instead  of  working  on  the  entire  sequence  at  once).  The  usage 
of  a  capital  letter  to  denote  the  output  image,  and  lower-case 
letters  to  denote  the  input  sequence,  serves  to  remind  us  that 
they  are  indeed  of  different  scales. 

Since  we  want  to  exploit  the  insight  gained  in  previous  sec¬ 
tions,  we  aim  at  defining  a  penalty  function  that  is  similar  to 
that  written  for  the  video  denoising  problem  in  (18).  However, 
the  target  image  X  and  the  input  images  y^  are  not  of  the  same 
scale,  forcing  a  change  to  the  penalty  term.  This  change  should 
reflect  the  fact  that  X  undergoes  blurring  and  decimation  steps, 
in  order  to  account  for  the  different  scales.  This  leads  us  to  the 
following  preliminary  (and  not  final!)  proposal: 

Vsr(X-)=1'  w[k,l,i,j,t\ 

;a-.0cs>«'[ . / 1 

x||RwDHX  -  Rij-ytll!  +  ATV(X).  (20) 

Introduced  into  the  penalty  term  are  two  operators  we  have 
met  in  the  introduction:  H — the  blurring  operator,  and  D — the 
decimation  operator.  Applying  these  operators  on  X  simulates 
the  imaging  process,  bringing  X  to  the  same  resolution  as  the 
input  sequence  y^.  The  additional  TV  (Total  Variation)  expres¬ 
sion  comes  to  regularize  the  deblurring  that  should  take  place  in 
this  expression  [38],  forcing  piece-wise  smoothness  of  the  de¬ 
sired  image,  by  accumulating  the  norms  of  the  gradients  with 
Li  norm. 

Taking  a  close  look  at  the  penalty  term  we  have  just  written 
reveals  that  it  cannot  provide  super  resolution  reconstruction.  By 
denoting  x  =  DHX,  it  is  clearly  seen  the  above  is  identical  to 
the  video  denoising  penalty  term  in  (18)  for  x.  This  means  that 
minimizing  this  penalty  term  effectively  divides  the  problem 
into  two  parts — the  first  performs  denoising  of  the  input  se¬ 
quence,  and  the  second  that  interpolates  and  de-blurs  the  re¬ 
sulting  image. 

The  reason  this  penalty  term  is  not  able  to  perform  super  res¬ 
olution  is  the  fact  that  the  operator  R,*.^  “sees”  only  1/s2  of 
the  pixels  in  the  image  (s  is  the  decimation  factor)  because  of 
the  decimation  that  precedes  it.  In  order  to  solve  this  problem, 
we  reverse  the  order  of  the  patch  extraction  and  decimation,  so 
that  a  patch  is  first  extracted  from  HX,  and  then  decimated  to 
the  same  resolution  level  as  y^  to  be  compared  to  a  patch  from 
it.  This  change  requires  a  change  in  the  operator  D  as  well,  as 
we  no  longer  use  the  constant  decimation  grid  of  the  image.  In¬ 
stead,  we  introduce  the  notation  Dp — a  patch  decimation  op¬ 
eration  that  ensures  that  the  center  pixel  of  the  patch  is  on  the 
decimation  grid. 
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Fig.  4.  Relation  between  the  patch  sizes  q  and  p  =  (q  —  1  )s  + 1  demonstrated 
on  specific  examples:  [g,  s ]  =  [3,1],  [3,2],  and  [3,3]. 


Fig.  5.  Description  of  the  expression  DpR^HX  —  Rf^y  t  in  (21). 


Since  there  are  two  different  scales  (and  sizes)  of  patches 
used,  we  also  must  differentiate  between  the  patch  extraction 
operators.  To  this  end,  we  denote  by  Rj^  and  R £  •  the  high-  and 
low-resolution  patch  extraction  operators  respectively.  Their 
sizes  are  linked  through  the  value  of  the  scaling  parameter  s: 
whereas  R£  ■  extracts  a  patch  of  size  q  x  q  pixels  (arbitrary), 
Rfz  extracts  a  patch  of  size  pxp  pixels,  where  p  =  s(q— 1)+1. 
The  relation  between  these  sizes  is  further  explained  in  Fig.  4. 
This  transforms  the  penalty  term  in  (20)  into 

^sfl(X)  =  EE  E 

(k,l)€Q  t€[l,...,T]  (i,j)€NL  ( k,l ) 

x  IlDpRg, HX  -  R&ytlla  +  XTVpq.  (21) 

Fig.  5  presents  a  block  diagram  to  explain  this  penalty.  As 
can  be  seen,  different  size  patches  are  extracted  from  x  and  yt, 
and  brought  to  the  same  grounds  by  a  decimation  operation. 

Note:  The  term  (i,j)  £  N(k,l)  is  no  longer  applicable  as 
before,  since  now  (i,j)  and  (k,l)  are  pixels  on  different  res¬ 
olution  grids.  Therefore,  we  introduce  a  new  notation  NL  re¬ 
ferring  to  the  equivalent  low-resolution  neighborhood  of  (k,  /), 
i.e.,  (i,j)  £  NL(k,l)  is  short  for  the  full  notation  (i,j)  s.t. 
(si,  sj)  £  N(k,l). 


The  penalty  term  we  have  defined  above  is  the  one  we  shall 
focus  on  hereafter.  Next,  we  show  how  to  use  this  penalty  term 
as  the  basis  for  a  practical  and  simple  super  resolution  algorithm. 
The  first  step  we  take  is  a  separation  of  the  deblurring  from  the 
fusion  of  the  images.  Using  the  substitution  Z  =  HX,  the  first 
problem  we  have  is  the  estimation  of  Z  from  the  measurements 
{y*}fc=i-  This  is  done  by  minimizing  the  energy  function 

?1sb(Z)=  EE  E 

xHDpRg.Z-R^.ytllj-  (22) 

Once  found,  we  use  the  found  Z  to  estimate  X  by  minimizing 
vi r(X)  =  ||Z  -  HX|||  +  ATT(X).  (23) 

The  second  part  is  a  classic  simple  deblurring  problem.  This 
is  of  course  an  under-determined  problem  (since  H  is  usually 
singular),  and  needs  regularization.  We  will  not  discuss  how  to 
solve  it  here,  see  [39]— [44]  for  a  background  view  of  this  field, 
along  with  several  state-of-the-art  deblurring  methods. 

The  idea  of  breaking  the  super-resolution  task  into  two 
parts — fusing  the  inputs  and  then  deblurring — has  been  sug¬ 
gested  previously  in  [13],  [18],  [20],  and  [21].  In  the  case 
of  pure  translational  motion,  and  when  both  problems  are 
treated  as  maximum-likelihood  (which  is  not  done  here),  such 
a  separation  is  equivalent  to  the  joint  solution.  It  is  important  to 
note  that  in  the  derivation  proposed  above,  the  separation  is  not 
optimal,  and  leads  to  inferior  results.  However,  the  separation 
allows  for  a  much  simpler  algorithm  (both  conceptually  and 
implementation-wise),  so  the  sub-optimality  is  the  price  to  be 
paid  for  this  simplicity. 

The  fusion  stage  in  (22)  seems  to  lack  regularization,  and 
as  such,  one  may  question  its  stability.  Stability  is  gained 
through  the  weights — those  should  be  computed  in  such  a 
way  that  ensures  that  every  pixel  has  at  least  a  few  “quality” 
(assigned  a  meaningful  weight)  destinations.  Furthermore,  by 
also  assigning  a  nonzero  weight  to  the  original  value  of  the 
pixel,  further  stability  is  obtained,  guaranteeing  that  in  case 
no  appropriate  patches  are  found,  the  result  will  default  to  the 
initial  estimate. 

We  shall  leave  the  deblurring  aside,  and  focus  hereafter  on 
the  fusion  of  the  measurements.  This  will  allow  us  to  simplify 
the  description  of  the  algorithm.  The  next  step  is  deriving  the 
penalty  function  in  (21)  with  respect  to  Z,  and  finding  the  zero 
intersection 

%Z1=2E  e  e 

(k,l)€Q  t€[l,...,T]  ( i,j)€NL(k,l ) 

x(DJ.R«)T(DJ,R^Z-Rf;Jy,) 

=  0.  (24) 

This  leads  to  the  solution 

z=  MM, 

_(k,l)€Q  i€[l,...,T]  (iJ)eNL  ( k,l ) 
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x  (Dpr£,)t  (Dpr£,)] 

w[k,l,ij,t] 


(25) 


which  can  be  further  simplified,  by  defining 


»Zs.r(z[M  ])=  E  H  w\k.l,i.J.i: 

x(Z[k,l\-yt[i,j]f  (29) 


leading  to  a  closed-form  solution 


MM]  =  XI  X 


leading  to 
Z  = 


n  -1 


E  ™[M](DpR£,)  (DpRg,) 
E  (°pRw)T 

(fe,0€fi 


XI  X  X  f]R^-yt 


(26) 


(27) 


The  right  term  that  involves  the  measurements  {y t}t=i  Per_ 
forms  a  series  of  simple  operations  that  include:  (i)  Extraction 
of  patches  from  the  measurements;  (ii)  Zero-padding  interpola¬ 
tion  of  these  patches  (done  by  Dj);  (iii)  An  accumulation  of  the 
resulting  patch  with  a  proper  weight  at  the  destination  location. 

Following  the  same  rationale  as  the  one  practiced  for  (9), 
the  matrix  to  be  inverted  in  the  above  expression  is  a  diagonal 
one  and  positive  semi-definite,  normalizing  the  accumulation 
in  each  pixel.  The  term  (DpR^)T(DpRj^)  extracts  a  patch 
from  location  (&,/),  scales  it  down  and  then  up  again  by  zero 
padding,  and  finally  puts  it  back  into  the  same  original  location. 
Thus,  this  is  a  diagonal  matrix  with  1-es  for  the  pixels  surviving 
this  path,  and  zeros  elsewhere.  The  matrix  to  be  inverted  is  a 
weighted  sum  of  such  diagonal  matrices,  and,  thus,  it  is  diag¬ 
onal  as  well.  If  every  pixel  (&,  Z)  gets  some  accumulation,  this 
matrix  is  strictly  positive  definite,  and  its  inversion  is  permitted. 


B.  Simplified  Numerical  Algorithm 

We  again  follow  the  path  of  Section  III,  and  propose  a  special 
and  simplified  case  where  the  patch  extraction  operator  Rf  ■ 

TT  ^ 

extracts  only  one  pixel.  This  means  that  R£j  extracts  a  patch 
of  size  s  x  s  pixels,  to  become  one  pixel  after  the  decimation 
operation  Dp.4  This  simplifies  the  penalty  function  in  (22),  since 
DpRgzZ  =  Z [k,l\  and  R^-yt  =  y t[hj],  leading  to 

Vsr(^)=  XX  X  w[k,l,i,j,t] 

x(z[M]-y*M)2.  (28) 

This  functional  is  separable,  handling  every  pixel  in  the  target 
image  Z  separately.  This  implies  that  an  independent  penalty 
can  be  written  for  every  destination  pixel 

4Note  that  this  change  of  patch-sizes  applies  only  to  the  penalty  term  itself, 
and  does  not  effect  the  patch  size  for  the  weights  computation. 


z[M]  = 


(30) 

In  order  for  this  solution  to  exist,  it  is  enough  that  for  each  pixel 
(, k ,  l)  there  exists  at  least  one  none-zero  weight  w[k ,  Z,  t]. 

Notice  that  the  above  formula  looks  exactly  the  same  as  the 
one  used  for  video  denoising,  posed  in  (19).  Are  the  two  equiv¬ 
alent?  The  answer  is  negative,  due  to  two  important  reasons. 
First,  there  is  a  gap  between  the  definitions  of  the  neighborhoods 
N(k,l)  and  NL(k,l).  Whereas  for  the  video  denoising  this 
neighborhood  is  defined  simply  as  the  set  of  all  neighbors  for 
the  central  location  (&,  l),  the  neighborhood  referred  to  in  (30) 
considers  locations  (i,j)  that  after  scaling  up  by  5  are  neighbors 
of  (fc,Z). 

The  second  difference  between  (19)  and  (30)  refers  to  the 
weights  to  be  used.  As  described  earlier,  we  want  the  weight 
w[k,l,i,j,t]  to  reflect  the  probability  that  the  pixel  Z[fc,Z]  and 
the  pixel  y t[hj]  originated  from  the  same  place.  This  compu¬ 
tation  will  be  based  on  the  similarity  of  the  area  around  both 
pixels,  in  the  same  manner  the  weights  for  the  NLM  are  com¬ 
puted  (since  they  serve  a  similar  purpose),  as  described  in  (3). 
The  function  /  that  takes  into  account  the  geometric  distance  be¬ 
tween  the  patches  is  set  to  be  constant,  thereby  giving  no  pref¬ 
erence  to  nearby  patches  over  distant  ones.  This  allows  more 
robustness  to  various  motion  patterns,  including  patterns  that 
contain  large  motions.  However,  in  some  cases,  it  may  be  ben¬ 
eficial  to  assign  higher  weights  to  patches  closer  temporally  or 
geometrically. 

However,  for  matching  these  two  areas,  we  need  to  address 
the  fact  that  the  neighborhoods  around  the  pixels  in  Z  and  in  yt 
are  not  of  the  same  scale.  We  can  address  this  in  one  of  two  ways: 
(i)  Down-sample  the  neighborhood  in  the  high-resolution  image 
and  bring  it  down  to  the  low  resolution  grid;  or  (ii)  Up-sample 
the  low-resolution  neighborhood  to  match  the  high-resolution 
one.  In  our  tests  we  have  found  that  these  two  options  perform 
similarly  well,  with  a  small  advantage  to  the  second  (and  more 
complex)  one. 

As  in  the  denoising  case,  the  computation  of  the  weights  re¬ 
quires  the  use  of  the  unknown  Z.  Instead,  the  weights  are  com¬ 
puted  by  using  an  estimated  version  of  Z  being  a  scaled-up 
version  of  the  reference  frame  we  aim  to  super-resolve.  This 
scale-up  is  done  using  a  conventional  image  interpolation  algo¬ 
rithm  such  as  bilinear,  bicubic  or  the  Lanczos  method  [45],  [46]. 

The  interpolated  images  are  only  crude  estimates  of  the  de¬ 
sired  outcome,  and,  therefore,  the  weights  computed  by  relying 
on  these  estimated  are  also  somewhat  crude.  Since  after  one  iter¬ 
ation  of  the  algorithm  we  obtain  a  better  estimate,  it  is  possible 
to  use  these  new  estimates  for  re-computing  the  weights,  and 
computing  a  better  still  estimate  of  the  desired  outcome.  This 
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TABLE  I 

Summary  of  the  Core  of  the  Super  Resolution 
Algorithm — Processing  One  Image  Using  a  Given  Initial 
Estimate  for  the  Super-Resolved  Sequence 


Objective:  Estimate  X*0  -  a  super-resolved  image  at 
time  to. 

Inputs:  •  yt  -  input  set  of  low  resolution  and  noisy 
images. 

•  s  -  the  desired  scaling  factor  (any  integer). 

•  q  -  the  size  of  the  low  resolution  patch  (RL). 

•  p  -  size  of  the  high  resolution  patch  (Rff): 
p  =  s(q-  1)  +  1. 

•  {Yt}£li  -  An  initial  estimate  of  the  super- 
resolved  sequence. 

Initialization: 

•  Set  Z t0  —  Yto. 

•  Set  V  and  W  to  be  zero  images  of  the  same  size 
as  Z*0. 

Fusion:  For  each  ( k,l )  E  O  and  each  ( i,j,t )  such  that 
(si,  sj,  t)  e  N(k,l,t0) 

1)  Compute  Weights :  w[k,l,i,j,t]  = 

exp{-\\Rk,iZto  -  Rsi,sjYt\\2/2a2}. 

2)  Accumulate  Inputs : 

•  Extract  the  low-resolution  patch  Rf^y*, 

•  upscale  it  by  zero-filling, 

•  accumulate  it  in  its  proper  location 

V  =  V  +  (n"f)rD;R,';;y,. 

3)  Accumulate  Weights :  For  each  patch  accumulated 
above,  apply  the  following  update  of  the  weight 
image 

W+  =  w[k,l,i,j,t]  (Ki)  DpR^l. 

Normalization:  Set  Zto[k,l]  =  V[k,  l]/W[k,  l]. 

Deblurring:  Minimize 

Vsr  (X)  =  \\Zt0  -  HX||!  +  XTV  (X)  w.r.t.  X, 
and  set  the  result  to  be  Xio . 

_ Result:  The  output  is  Xt„. _ 

process  may  be  iterated  several  times,  although  in  our  simula¬ 
tions  we  use  only  two  such  iterations  (i.e.,  one  iteration  that  re¬ 
lies  on  the  interpolated  frames  for  computing  the  weights,  and 
a  second  iteration  that  relies  on  the  results  of  the  first  iteration). 
This  means  that  all  frames  are  first  processed  using  the  interpo¬ 
lated  frames  for  the  weights  computation,  and  only  then  is  the 
second  iteration  applied.  If  throughput  constraints  do  not  allow 
this,  it  is  possible  to  compute  the  weights  in  the  second  itera¬ 
tion  while  still  using  the  interpolated  versions  of  all  frames,  and 
a  new  estimate  is  only  used  for  the  currently  processed  frame. 
This  requires  some  minor  modifications  to  the  weights  compu¬ 
tation. 

C.  Proposed  Algorithms — A  Summary 

We  now  summarize  the  two  algorithms  previously  described. 
First,  we  describe  in  Tables  I  and  II  the  general  super-resolution 
algorithms  that  were  developed  in  this  section.  The  simpler  ver¬ 
sion  that  uses  low-resolution  patches  of  one  pixel  is  obtained 


TABLE  II 

Summary  of  the  Entire  Super  Resolution  Algorithm — Used  for 
Resolving  All  the  Frames  in  the  Sequence 


Objective:  Estimate  |x*  |  -  the  super-resolved  se¬ 
quence.  1 

Inputs:  •  {y t}J=1  -  input  set  of  low  resolution  and 
noisy  images; 

•  s  -  the  desired  scaling  factor  (any  integer). 
Preprocessing:  Compute  {Yt}^  -  a  Lanczos  inter¬ 
polation  of  the  input  sequence. 

Iterating:  Perform  the  following  steps  per  each  iteration 

1)  Super-resolve  each  image  using  the  algorithm 
outlined  in  Table  I,  obtaining  a  new  estimate  for 

the  super-resolved  sequence  |xt|  • 

2)  Update  current  estimation  {Yt}^  =  ’ 

and  use  as  initial  estimate  for  the  following  itera¬ 
tion. 

Result:  The  output  is  |xt  J  . 

by  using  the  proper  assignments  for  and  Rf  ’  ■  (while  not 
changing  R). 

The  complexity  of  these  two  algorithms  is  essentially  the 
same  as  that  of  their  NLM  counterparts,  with  the  addition  of 
a  deblurring  process,  which  can  be  considered  negligible.  The 
core  of  the  algorithm,  which  also  requires  most  of  the  compu¬ 
tations,  is  computing  the  weights.  Considering  a  nominal  case 
with  a  search  area  of  3 1  x  31  low-resolution  pixels  in  the  spatial 
domain,  and  15  images  in  the  temporal  axis,  we  have  &  14,000 
pixels  in  this  spatio-temporal  window.  For  each  pixel  in  the 
search  area,  the  block  difference  is  computed,  with  a  block  size 
of  13  X  13  (high-res.)  pixels.  Thus,  there  is  a  total  of  almost 
2,400,000  operations  per  pixel.  This  amount  is  of  course  very 
large,  and  must  be  reduced  in  order  to  make  the  algorithm  prac¬ 
tical.  We  will  now  describe  a  few  speedup  methods  for  the  pro¬ 
posed  algorithm.  Several  of  the  methods  to  speedup  the  NLM 
algorithm  were  suggested  originally  in  [47],  and  were  adopted 
in  our  simulations. 

1)  Computing  the  weights  can  be  done  using  block  differ¬ 
ences  in  the  low-resolution  images,  instead  of  on  the  in¬ 
terpolated  images.  This  saves  a  factor  of  «  s2. 

2)  Computing  fast  estimations  for  the  similarity  between 
blocks,  such  as  the  difference  between  the  average  grey 
level  or  the  average  direction  of  the  gradient,  can  eliminate 
many  nonprobable  destinations  from  further  processing. 
Such  an  approach  was  suggested  in  [47],  and  was  found  to 
be  very  effective  for  the  original  NLM  algorithm. 

3)  If  the  patch  used  to  compute  the  weights  is  rectangular 

with  equal  weights  to  all  pixels,  the  computation  of  the 
weight  can  be  sped  up  dramatically.  Using  the  Integral 
Image  (II(x,  y )  =  J2i=i  Y^=i  the  block  differ- 

ence  can  be  computed  using  a  small  constant  number  of 
calculations  per  pixel,  regardless  of  block  size.  Using  such 
a  patch  structure  has  only  a  slight  effect  on  the  quality  of 
the  outputs  [48]. 
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The  state  of  the  art  movie  restoration  methods  like  AWA,  LMMSE 
either  estimate  motion  and  filter  out  the  trajectories,  or  compensate  the 
motion  by  an  optical  now  estimate  and  then  filter  out  the  compensated 
movie.  Now,  the  motion  estimation  problem  is  fundamentally  ill-posed. 
This  fact  is  known  as  the  aperture  problem:  trajectories  are  ambiguous 
since  they  could  coincide  with  any  promenade  in  the  space-time  isophotc 
surface.  In  this  paper,  we  try  to  show  that,  for  denoising,  the  aperture 
problem  can  be  taken  advantage  of.  Indeed,  by  the  aperture  problem, 
many  pixels  in  the  neighboring  frames  are  similar  to  the  current  pixel 
one  wishes  to  denoise.  Thus,  denoising  by  an  averaging  process  can  use 
many  more  pixels  than  just  the  ones  on  a  single  trajectory.  This  obser¬ 
vation  leads  to  use  for  movies  a  recently  introduced  denoising  method, 
the  NL-means  algorithm.  This  static  3D  algorithm  outperforms  motion 
compensated  algorithms,  as  it  does  not  lose  movie  details.  It  involves  the 
whole  movie  isophote,  including  the  current  frame,  and  not  just  a  trajec¬ 
tory.  Experimental  evidence  will  be  given  that  it  also  improves  the  \dirt 
and  sparkle"  detection  algorithms 

(a) 


Tic  itr.f  of  lie  art  racvtc  rriUraUrs  ckUc<i  like  AWA  LKMIK 
fKifi  n'.irc'.r  *«•;■«•  •«<  filet  ••<  ht  li  tr«ln<ln  «•  ittfeiulr  II# 
iuii»  by  m  cftical  Hew  ntiruii  ai>J  iliac,  flhir  ir.  at  cirfiiuiil 
rrertr.  Niw,  llie  ncifcxi  earn*  c*  jrsblnr.  Ii  fnftrsrua  >  (I  f«cf 
Tilt  fan  ii  cuftT.  ii  He  lywivw  pvaSVa:  ira;«;t<x'Mi  ui  inbifiaji 
ih;<  tl#y  rail!  talvili  wilt  uy  >rctr#iali  Ir  ii#  iy  »;*  Una  u>ir.a 
uifuf.  1ft  ilia  pif<r  wr  try  to  lid  fir  Iniltht  Uc 
pccllro  rift  U  u In  atviriafi  at  lr.li<£,  by  lie  ay«tti4;t  |f  rtliir, 
mr?  |l*r  i  in  llic  rrifTbir.rtt  frvrei  arc  Krrilk?  Ir  lie  (urrni  par 

ue  ulUetli  4ni>v  hu,  4tulilt|  «r  |itn|h|  yiuni  ««•  **# 
iratf  raara  y lull  Hat  jut  lie  ccei  tt  a  ilif.W  ireeitory  THU  ckur 
vein  Inca  to  u»r  far  trwdn  a  rccnal*  UlrKuN  denxsui  rtrlbbi 
If#  M  wm  •  viht  Tin  *!•-•  .Ill  liar  a«lpaifit  rr%  wtati 

cirviMiui  aljirtMi  m  it  tea i  r cm.  lata  nare  <1rtaia.  2  ItioIwi  :u 
•bik  imiruiylftr,  lu’ulin'.iroi'rntf.'anr,  ltd  rcl  )n:  alrajrc- 
!*•••  I  »f#t  inetlal  eallttoa  wll  b#  f  \#»  llill  II  !n;nr.ti  Of  >4nl 

u4ipirlir  irectln  il|vKkr* 

(c) 


The  stale  of  the  art  movie  restoration  methods  like  AWA,  LMMSE 
either  estimate  motion  and  filter  out  the  trajectories,  or  compensate  Use 
motion  by  an  optical  now  estimate  and  then  filter  out  the  compensated 
movie.  Now,  the  motion  estimation  problem  is  fundamentally  ill-posed. 
This  fact  Is  known  as  the  aperture  problem:  trajectories  are  ambiguous 
since  they  could  coincide  with  any  promenade  In  the  space-time  isophote 
surface.  In  this  paper,  we  try  to  show  that,  for  denoising,  the  aperture 
problem  can  be  taken  advantage  of.  Indeed,  by  the  aperture  problem, 
many  pixels  In  the  neighboring  frames  are  similar  to  the  current  pixel 
one  wishes  to  denoise.  Thus,  denoising  by  an  averaging  process  can  use 
many  more  pixels  than  just  the  ones  on  a  single  trajectory.  This  obser¬ 
vation  leads  to  use  for  movies  a  recently  introduced  denoising  method, 
the  NL-means  algorithm.  This  static  3D  algorithm  outperforms  motion 
compensated  algorithms,  as  it  does  not  lose  movie  details.  It  involves  the 
whole  movie  isophotc,  including  the  current  frame,  and  not  just  a  trajec¬ 
tory.  Experimental  evidence  will  be  given  that  it  also  improves  Uie  \dirt 
and  sparkle"  detection  algorithms 
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Tlie  state  of  the  art  movie  restoration  methods  like  AWA,  LMMSE 
either  estimate  motion  and  filter  out  Use  trajectories,  or  compensate  the 
motion  by  an  optical  flow  estimate  and  then  filter  out  the  compensated 
movie.  Now,  the  motion  estimation  problem  is  fundamentally  ill-posed. 
Tills  fact  is  known  as  the  aperture  problem:  trajectories  are  ambiguous 
since  they  could  coincide  with  any  promenade  in  the  space-time  isophote 
surface.  In  this  paper,  we  try  to  show  that,  for  denoising,  the  aperture 
problem  can  be  taken  advantage  of.  Indeed,  by  Uie  aperture  problem, 
many  pixels  in  the  neighboring  frames  are  similar  to  the  current  pixel 
one  wishes  to  denoise.  Thus,  denoising  by  an  averaging  process  can  use 
many  more  pixels  than  Just  Use  ones  on  a  single  trajectory.  Tills  obser¬ 
vation  leads  to  use  for  movies  a  recently  introduced  denoising  method, 
the  NL-means  algorithm.  This  static  3D  algorithm  outperforms  motion 
compensated  algorithms,  as  it  does  not  lose  movie  details.  It  involves  the 
whole  movie  isophote,  including  the  current  frame,  and  not  just  a  trajec¬ 
tory.  Experimental  evidence  will  be  given  that  it  also  improves  Uie  \dirt 
and  sparkle"  detection  algorithms 

(f) 


Fig.  6.  Results  for  the  synthetic  text  sequence,  (a)  Original  (ground-truth)  image,  (b)  Pixel  replicated  image,  13.47  dB.  (c)  Lanczos  interpolation,  13.84  dB. 
(d)  Deblurred  Lanczos  interpolation,  13.9  dB.  (e)  Result  of  shift-and-add  algorithm  [18],  [20],  18.4  dB.  (f)  Result  of  proposed  algorithm,  18.48  dB. 


4)  A  coarse-to-fine  approach  reduces  the  effective  search  area 
for  each  pixel,  thus  reducing  the  number  of  required  calcu¬ 
lations. 

5)  The  search  area  can  be  adapted  to  the  temporal  distance, 
making  it  small  for  nearby  images  and  growing  for  images 
further  away,  also  reducing  the  total  effective  search  area. 


6)  Since  weights  are  symmetrical,  half  the  calculations  can  be 
saved,  by  applying  computed  weights  to  both  participating 
patches. 

7)  Since  most  of  the  algorithm  is  local  in  nature,  it  lends  it¬ 
self  easily  to  parallelization.  As  4  and  8  processor  configu¬ 
rations  are  currently  widely  available,  this  can  be  used  for 
speeding  up  the  algorithm  by  about  one  order  of  magni¬ 
tude. 
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The  state  or  the  art  movie  restoration  methods  like  AWA,  I.MMSK 
either  estimate  motion  and  filter  out  the  trajectories,  or  compensate  the 
motion  by  an  optical  now  estimate  and  then  filter  out  the  compensated 
movie.  Now,  the  motion  estimation  problem  is  fundamentally  ill-posed. 
This  fact  is  known  as  the  aperture  problem:  trajectories  are  ambiguous 
since  they  could  coincide  with  any  promenade  in  Lite  space-time  isophote 
surface.  In  this  paper,  we  try  to  show  that,  for  denoising.  the  aperture 
problem  can  be  taken  advantage  of.  Indeed,  by  the  aperture  problem, 
many  pixels  in  the  neighboring  frames  are  similar  to  the  current  pixel 
one  wishes  to  denoise.  Thus,  denoising  by  an  averaging  process  can  use 
many  more  pixels  than  just  the  ones  on  a  single  trajectory.  This  obser¬ 
vation  leads  Lo  use  for  movies  a  recently  introduced  denoising  method, 
the  KL-means  algorithm.  This  static  3D  algorithm  outperforms  motion 
compensated  algorithms,  as  it  does  not  lose  movie  details.  It  involves  the 
whole  movie  isophote,  including  the  current  frame,  and  not  just  a  trajec¬ 
tory.  Experimental  evidence  will  be  given  that  it  also  improves  the  \dirt 
and  sparkle"  detection  algoritltms 


The  state  of  the  art  movie  restoration  methods  like  AWA,  LMMSE 
either  estimate  motion  and  filler  out  Use  trajectories,  or  compensate  tire 
motion  by  an  optical  flow  estimate  and  then  filter  out  Die  compensated 
movie.  Now,  the  motion  estimation  problem  Is  fundamentally  ill-posed. 
This  fact  is  known  as  the  aperture  problem :  trajectories  are  ambiguous 
since  they  could  coincide  with  any  promenade  In  the  space-time  isophote 
surface.  In  this  paper,  we  try  to  show  that,  for  denoising,  the  aperture 
problem  can  be  taken  advantage  of.  Indeed,  by  the  aperture  problem, 
many  pixels  in  lire  neighboring  frames  are  similar  to  the  current  pixel 
one  wishes  to  denoise.  Thus,  denoising  by  an  averaging  process  can  use 
many  more  pixels  than  Just  the  ones  on  a  single  trajectory.  Tills  obser¬ 
vation  leads  to  use  for  movies  a  recently  introduced  denoising  method, 
the  NL-means  algoriliun.  This  static  3D  algorithm  outperforms  motion 
compensated  algorithms,  as  it  does  noL  lose  movie  details.  It  involves  the 
whole  movie  isophote,  including  Use  current  frame,  and  not  just  a  trajec¬ 
tory.  Experimental  evidence  will  be  given  that  It  also  improves  the  idirt 
and  sparkle"  detection  algorithms 


Fig.  7.  Results  for  the  synthetic  text  sequence,  with  one  image  missing.  Left:  result  of  shift-and-add  algorithm  [18],  [20],  18.16  dB.  Right:  result  of  proposed 
algorithm,  17.7  dB. 


These  suggested  speedup  methods  can  reduce  the  complexity 
by  at  least  3-4  orders  of  magnitude  without  a  noticeable  drop  in 
the  quality  of  the  outputs.  This  makes  the  proposed  algorithm 
practical.  As  for  the  memory  requirement,  the  proposed  algo¬ 
rithm  uses  approximately  as  much  memory  as  required  to  hold 
the  entire  processed  sequence  in  the  high  resolution  scale.  This 
is  usually  not  a  limitation. 

V.  Experimental  Results 

A.  Experiments 

In  this  section,  we  validate  the  potential  of  the  proposed  algo¬ 
rithm  (we  use  the  simpler  version  discussed  in  Section  IV-B)  by 
presenting  the  obtained  results  of  processing  several  image  se¬ 
quences.  We  start  with  one  synthetic  (text)  sequence  with  global 
motion  that  comes  to  demonstrate  the  conceptual  super-resolu¬ 
tion  capabilities  of  the  proposed  algorithms.  Then  we  turn  to 
three  real  sequences  with  a  general  motion  pattern.  As  there 
are  no  published  methods  that  perform  super  resolution  on  such 
general  sequences,  the  comparison  we  provide  is  to  a  single 
image  up-sampling  using  the  Lanczos  algorithm  [45],  [46]  that 
effectively  approximates  the  Sine  interpolation. 

The  Lanczos  interpolation  also  serves  as  a  benchmark  for  the 
synthetic  case,  even  though  super  resolution  algorithms  that  rely 
on  global  motion  estimation  can  process  this  specific  sequence 
quite  successfully.  This  is  because  the  benchmark  for  the  pro¬ 
posed  algorithm  should  also  be  able  to  handle  all  sequences,  and 
not  limited  to  the  global  motion  case. 

All  the  tests  in  this  section  are  prepared  in  the  following 
manner:  An  input  sequence  is  blurred  using  a  3  X  3  uniform 
mask,  decimated  by  a  factor  of  1 :3  (in  each  axis),  and  then  con¬ 
taminated  by  an  additive  noise  with  std  =  2.  All  images  are  in 
the  input  range  [0,255]. 

The  first  test  is  a  very  simple  synthetic  test,  that  motion-es- 
timated-based  super-resolution  algorithms  are  expected  to  re¬ 
solve  well,  intended  to  show  that  the  proposed  algorithm  in¬ 
deed  achieves  super  resolution.  A  text  image  is  used  to  generate 
a  9-image  input  sequence,  by  applying  integer  displacements 
prior  to  blurring,  decimation  and  the  addition  of  noise.  The  dis¬ 
placements  are  chosen  so  that  the  entire  decimation  space  is  cov¬ 
ered  (i.e.,  dx  =  {0, 1, 2}  and  dy  =  {0, 1, 2}).  The  result  for 


this  test  is  shown  in  Fig.  6,  including  a  comparison  to  the  results 
obtained  by  the  regularized  shift-and-add  algorithm  [18],  [20], 
which  is  a  conventional  motion-estimation-based  super-resolu¬ 
tion  algorithm  .  The  block  size  used  for  computing  the  weights 
(R)  was  set  to  3 1  x  31,  since  the  motion  in  the  sequence  is  lim¬ 
ited  to  displacements,  and  a  larger  block  allows  capturing  the 
true  displacement  better  (for  true  sequences,  this  size  will  be 
greatly  reduced,  as  explained  later).  The  value  of  a  that  moder¬ 
ates  the  weights  was  set  to  7.5  (due  to  the  large  differences  be¬ 
tween  white  and  black  values  in  the  scene).  Two  iterations  were 
ran  on  the  entire  sequence,  the  first  iteration  used  for  computing 
the  weights  for  the  second  iteration. 

We  also  ran  a  similar  test,  displaying  the  behavior  of  the  pro¬ 
posed  algorithm  and  shift-and-add  approach  when  one  of  the 
images  from  the  set  is  omitted.  The  results  for  this  test  are  dis¬ 
played  in  Fig.  7. 

Even  though  the  proposed  algorithm  does  not  exploit  the  fact 
the  motion  in  the  sequence  is  only  global  translation,  it  still 
achieves  good  results.  The  text  is  almost  completely  readable  in 
the  result  of  the  proposed  algorithm.  This  shows  that  sub-pixel 
details  (relative  to  the  low  resolution  grid)  are  indeed  recov¬ 
ered  by  the  proposed  algorithm.  In  terms  of  Peak  Signal  to 
Noise  Ratio  (PSNR),5  a  3:1  pixel-replication  in  each  axis  leads 
to  13.47  dB,  the  Lanczos  interpolation  gives  13.84  dB,  a  de- 
blurred  Lanczos  interpolation  gives  13.9  dB,  the  regularized 
shift-and-add  gives  18.4  dB,  and  the  proposed  algorithm  gives 
18.48  dB,  slightly  out-performing  the  classic  approach.  For  the 
test  with  one  image  omitted,  the  shift-and-add  gives  18.16  dB, 
while  the  proposed  algorithm  slightly  under-performs  with  a 
result  of  17.7  dB.  The  similarity  in  performance  between  the 
shift-and-add  approach  and  the  proposed  approach  attests  to  the 
power  of  the  proposed  method,  as  even  though  it  does  not  rely 
directly  on  the  global  motion  assumption,  it  achieves  similar  re¬ 
sults  to  a  successful  method  that  does. 

It  is  obvious  that  while  the  PSNR  measures  of  the 
shift-and-add  approach  and  the  proposed  algorithm  are  similar, 
the  results  of  the  shift-and-add  are  visually  more  pleasing.  This 
is  to  be  expected,  since  this  approach  is  able  to  fully  benefit 

5Defined  as  20  log10(255/ \/ MSE ) ,  where  MSE  is  the  mean-squared-error 
obtained  per  pixel. 
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Fig.  8.  Results  for  the  3rd,  8th,  13th,  18th,  23rd,  and  the  28th  frames  from  the 
“Miss  America”  sequence.  From  left  to  right:  Low  resolution  image;  original 
image  (ground  truth);  Lanczos  interpolation;  result  of  the  proposed  algorithm. 


from  the  global  motion  assumption,  while  the  proposed  algo¬ 
rithm  does  not.  This  is  due  to  the  proposed  algorithm  trading 
off  quality  for  increased  robustness  to  general  motion  patterns 
(which  the  shift- and-add  approach  can’t  cope  with). 

When  observing  the  resulting  images  of  the  proposed  algo¬ 
rithm,  they  seem  somewhat  over- sharpened.  In  fact,  applying 
less  deblurring  iterations  results  in  a  more  visually  pleasing  re¬ 
sult,  however,  the  objective  PSNR  measure  yields  a  lower  value 


* 


*4 


m 


**! 


ygii 


Fig.  9.  Results  for  the  3rd,  8th,  13th,  18th,  23rd,  and  the  28th  frames  from  the 
“Foreman”  sequence.  From  left  to  right:  Low  resolution  image;  original  image 
(ground  truth);  Lanczos  interpolation;  result  of  the  proposed  algorithm. 


for  these  images.  It  seems  like  a  stronger  deblurring  mechanism, 
applied  to  the  results  of  the  fusion  stage,  can  generate  results  that 
are  both  visually  pleasing  and  relatively  accurate  reconstruc¬ 
tions.  Since  the  novelty  of  this  paper  lies  in  the  fusion  stage, 
we  have  not  experimented  with  other  such  mechanisms. 

We  now  turn  to  test  the  proposed  algorithm  on  real  se¬ 
quences,  containing  general  motion  patterns,  which  represent 
the  actual  problem  the  proposed  algorithm  is  designed  to  tackle. 
The  3  sequences  we  test  on  are  usually  used  for  testing  com¬ 
pression  algorithms,  referred  to  as  “Miss  America,”  “Foreman, 
”  and  “Suzie.”  The  super-resolution  results  for  these  sequences 
(scaling  up  by  a  factor  of  3  to  1  in  each  axis)  are  shown  in 
Figs.  8-10,  respectively.  All  of  these  sequences  (original, 
low-quality,  Lanczos  interpolation  and  processed  sequences) 
appear  in  the  first  author’s  website,  at  http:Wwww.cs.tech- 
nion .  ac  .il\~matanpr\NLM-  SR. 

Each  result  shows  several  (3rd,  8th,  13th,  18th,  23rd,  and 
the  28th)  frames  from  each  sequence  (each  row  represents  one 
frame).  For  each  frame,  the  input  low  resolution  image,  the 
ground  truth  image,  the  result  of  the  Lanczos  interpolation,  and 
the  result  of  the  proposed  algorithm  are  presented  from  left  to 
right.  It  is  very  evident  that  the  results  of  the  proposed  algo¬ 
rithm  are  dramatically  better  when  compared  to  the  Lanczos  in¬ 
terpolation — the  output  is  sharper,  contains  more  details,  and  is 
more  visually  pleasing.  It  is  important  to  note,  that  we  display 
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TABLE  III 

Mean-PSNR  Results  for  the  Three  Test  Sequences.  This  Tables  Gives  the  PSNR  for  the  Pixel-Replicated  Sequences, 
the  Lanczos  Interpolation,  the  Lanczos  Interpolation  With  Consequent  Deblurring,  and  the  Results 
of  the  Proposed  Algorithm  (First  and  Second  Iteration) 


Sequence 

Pixel 

Replication 

Lanczos 

Interpolation 

Deblurred  Lancozs 
Interpolation 

Our  Result 
(1st  Iteration) 

Our  Result 
(2nd  iteration) 

Miss- America 

31.43 

34.08 

34.26 

35.91 

34.97 

Foreman 

28.99 

31.25 

31.13 

34.27 

33.13 

Suzie 

30.02 

31.4 

31.59 

33.74 

33.32 

Fig.  10.  Results  for  the  3rd,  8th,  13th,  18th,  23rd,  and  the  28th  frames  from 
the  “Suzie”  sequence.  From  left  to  right:  Low  resolution  image;  original  image 
(ground  truth);  Lanczos  interpolation;  result  of  the  proposed  algorithm. 


the  results  of  the  Lanczos  interpolation  without  any  postdeblur- 
ring.  Applying  deblurring  to  the  Lanczos  interpolation  results 
in  a  slightly  sharper  image,  but  one  that  also  contains  unwanted 
artifacts  [as  can  be  seen  in  Fig.  6(c)  and  (d)]  and  the  PSNR  mea¬ 
sures  are  also  equivalent. 

In  processing  all  of  these  sequences,  all  30  frames  partici¬ 
pated  in  the  reconstruction  of  each  image.  The  similarity  block 
size  used  for  computing  the  weights  (R)  was  13  x  13,  and  did 
not  vary  between  the  different  tests.  The  search  area  (i.e.,  the 
size  of  the  neighborhood  N )  was  determined  manually,  in  order 
to  reduce  computation  time:6  a  search  area  of  13  X  13  pixels  in 
each  image  for  the  “Miss  America”  sequence,  21  x  21  for  the 
“Foreman”  sequence,  and  31  x  31  for  the  “Suzie”  sequence. 
The  parameter  a  was  set  to  2.2  for  all  sequences.  Two  iterations 
were  again  used,  where  the  first  provides  the  updated  weights 
for  the  second  iteration.  Just  to  put  things  into  perspective,  we 
add  that  the  entire  simulation  is  done  on  Matlab,  running  on  a 
regular  Pentium  3  GHz  (2-GB  RAM)  machine,  requiring  ap¬ 
proximately  20  s  per  frame  for  the  most  demanding  case — the 

6Applying  a  bigger  search  area  results  in  an  only  slightly  different  super¬ 
resolution  outcome. 


“Suzie”  sequence  with  high-resolution  frame  size  of  210  X  250 
pixels. 

Table  III  presents  the  mean  PSNR  for  each  of  the  three  test  se¬ 
quences,  evaluating  the  quality  of  the  pixel-replicated  scaled-up 
sequence,  the  Lanczos  interpolation,  and  the  results  we  obtain 
after  the  first  and  the  second  iterations  of  the  proposed  algo¬ 
rithm.  While  the  super-resolution  results  show  higher  PSNR,  we 
see  that  i)  the  PSNR  gain  is  mild,  and  does  not  reflect  the  visual 
quality  of  the  sequences  obtained;  and  ii)  Even  though  the  first 
iteration  result  is  typically  inferior  in  visual  quality,  its  PSNR 
assessment  tends  to  be  higher. 

Beyond  the  usual  and  common  complaint  on  the  inability  of 
PSNR  to  grasp  image  quality  in  general,  it  seems  that  our  algo¬ 
rithm  in  particular  is  very  sensitive  to  this  measure.  One  possible 
reason  for  this  could  be  the  fact  that  in  avoiding  explicit  motion 
estimation,  the  result  we  obtain  is  slightly  shifted  (or  warped) 
with  respect  to  the  reference  it  aims  to  recover.  This,  of-course, 
leads  to  a  disastrous  drop  in  PSNR.  Another  reason  that  may  ac¬ 
count  for  such  behavior  is  the  over- sharpening  that  the  deblur¬ 
ring  stage  introduces.  We  intend  to  further  explore  this  matter  in 
our  future  work,  as  we  hope  to  provide  visually  pleasing  results 
that  are  also  backed  up  by  good  PSNR  behavior. 

B.  Discussion 

The  results  obtained  are  very  encouraging,  being  artifact-free 
and  of  high  quality.  Still,  we  believe  that  these  results  could  be 
further  improved  and  extended  in  various  ways.  The  manual  se¬ 
lection  of  the  various  parameters  could  be  replaced  by  an  adap¬ 
tive  setting  of  their  values.  One  specific  parameter  of  importance 
is  the  filtering  parameter  a,  effecting  the  weights  computation. 
Selecting  this  parameter  in  a  locally  adaptive  way  seems  like 
a  natural  step  to  take,  allowing  the  algorithm  to  adapt  better  to 
both  small  and  large  details. 

Similarly,  an  adaptive  selection  of  the  window  size  (both  for 
the  weights  computation  and  in  the  algorithm  itself),  as  done  in 
[28]  and  [29],  can  improve  the  algorithm’s  performance.  A  large 
block  can  help  estimate  the  motion  more  accurately.  A  small 
block  can  better  adapt  to  general  and  varying  motion.  Adap¬ 
tively  selecting  the  window  size,  as  in  [28],  is  expected  to  de¬ 
liver  much  better  results  than  just  settling  for  a  global  tradeoff. 

A  similar,  yet  somewhat  different  concept,  is  a  better  con¬ 
trol  over  the  search  area.  While  the  estimated  motion  pattern 
is  fuzzy,  reducing  the  number  of  candidate  locations  can  help 
reduce  complexity  and  improve  the  results.  Relying  on  some 
coarse  optical  flow  computation  as  a  preceding  step  for  the  fuzzy 
motion  estimation  can  bring  these  improvements.  An  alternative 
is  computing  a  coarse  fuzzy  motion  pattern  in  a  coarser  scale, 


Authorized  licensed  use  limited  to:  IEEE  Xplore.  Downloaded  on  December  18,  2008  at  18:02  from  IEEE  Xplore.  Restrictions  apply. 


PROTTER  etal.:  GENERALIZING  THE  NONLOCAL-MEANS  TO  SUPER-RESOLUTION  RECONSTRUCTION 


49 


and  proceeding  to  search  in  the  actual  scale  only  around  the 
likely  coarse  destination. 

A  somewhat  different  direction  for  further  research  is 
adopting  the  fuzzy  motion  concept  in  other  approaches  to  the 
super  resolution  problem.  In  the  video  denoising  field,  the  very 
successful  NLM  was  followed  by  even  more  successful  con¬ 
tributions  using  fuzzy  motion,  such  as  the  SW3D  [29]  and  the 
method  relying  on  sparse  and  redundant  representations  [30]. 
We  believe  that  these  approaches,  and  others,  can  also  serve  as 
the  basis  for  more  successful  super  resolution  algorithms  that 
do  not  rely  on  explicit  motion  estimation. 

VI.  Summary 

This  paper  introduces  a  novel  and  successful  super  resolution 
algorithm,  which  does  not  rely  on  explicit  motion  estimation.  In¬ 
stead,  a  local  and  patch-based  approach  is  combined  with  fuzzy 
motion  estimation.  This  allows  the  algorithm  to  handle  diverse 
motion  fields,  instead  of  the  common  global  motion  limitation 
that  characterizes  traditional  super-resolution  methods. 

The  algorithm  developed  here  is  an  extension  of  the  very  suc¬ 
cessful  video  denoising  Nonlocal-Means  algorithm,  reported  in 
[31].  Its  ability  to  denoise  image  sequences  without  explicit  mo¬ 
tion  estimation,  which  was  previously  considered  necessary  in 
this  field,  led  us  to  start  our  search  for  an  explicit-motion-esti- 
mation-free  super  resolution  algorithm  there. 

By  analyzing  the  forces  existing  in  the  NLM  algorithm,  we 
were  able  to  write  an  energy  function,  whose  minimization  leads 
to  a  powerful  denoising  algorithm,  of  which  the  NLM  is  a  spe¬ 
cial  case.  Then,  we  extended  this  energy  function  to  the  super 
resolution  problem,  by  making  the  necessary  changes.  Starting 
from  this  energy  minimization,  we  developed  a  simple  yet  very 
effective  super  resolution  algorithm. 

Finally,  we  have  processed  some  real  sequences  that  contain 
complex  motion  patterns  with  the  proposed  algorithm.  The  ob¬ 
tained  results  are  artifact  free  and  of  high  quality,  thus  proving 
the  ability  of  the  proposed  method  to  handle  general  sequences. 
Lastly,  we  have  made  several  suggestions  as  to  directions  for 
further  research. 


Proof:  We  start  with  (Al).  Both  sides  of  this  equation 
describe  processes  where  image  patches  are  piled  with  proper 
weights.  Consider  an  arbitrary  location  (m,n)  in  the  resulting 
image,  and  lets  see  which  patches  are  accumulated  at  this  point 
as  a  center  one.  Recall  that  the  operator  R assigns  a  patch  to 
location  (&,/). 

Looking  at  the  Left-Hand-Side  (LHS),  the  only  patch  to  be 
positioned  in  location  (m,n)  (i.e.,  to  be  multiplied  by  R^n) 
is  nx,  but  this  is  done  several  times,  accumulating  a  total 
weight  of  T,(i,j)eN(m,n)  n ,  i,j]. 

The  Right-Hand-Side  (RHS)  performs  a  slightly  more  in¬ 
volved  accumulation.  Among  all  (&,  Z)  pixels  in  the  image,  only 
those  that  are  neighbors  to  (m,  n)  are  relevant,  and  each  of 
those  contributes  one  patch  of  the  form  R^  nx  with  a  weight 
w[k ,  Z,  m,  n] .  Thus,  again,  we  obtain  that  only  one  patch  is  used, 
but  accumulated  several  times.  The  total  weight  of  this  accumu¬ 
lation  is  given  by  J](k,z)eN(m,n)  w[k,  l,  n].  Due  to  the  sym¬ 
metry  of  the  weights,  we  have 


Y  w[k,  Z,  m,  n\  =  Y^  w[m,  n,  k ,  Z]  (A3) 

( k,l)£N(m,n )  (fc,Z)EAT(m,n) 


which  is  the  same  as  the  LHS  accumulated  weight. 

Turning  to  the  equality  in  (A2),  we  adopt  a  similar  analysis. 
Starting  this  time  with  the  RHS,  the  overall  patches  to  be  accu¬ 
mulated  in  location  (m,  n)  (i.e.,  those  multiplied  by  R^?n)  are 
given  by 


E 


(i,j)EN(m,n) 


In  the  LHS,  only  pixels  (k,l)  that  have  (m,  n)  in  their  neigh¬ 
borhood  are  relevant  in  the  outer  summation.  For  those,  only 
one  neighbor,  the  (m,  n)  one,  is  relevant,  as  this  is  the  only  one 
multiplied  by  R^  n,  and,  thus,  positioned  in  the  location  we 
consider.  Thus,  the  accumulation  in  this  case  becomes 

w[A;,Z,ra,n]Rfc?zx  (A4) 

( k,l)€N(m,n ) 


Appendix  A 
Proofs 

Theorem  1:  Assuming  that 

1)  (hj)  C  N(k,t)  implies  (k,t)  G  N(i,j),  and 

2)  w[k,l,i,j]  =  w[i,j,k,l\, 

the  following  two  equalities  hold  true: 

Y  Y  w[k,fi,j]RljRkjx 

(k,i)en(i,j)eN(k,i) 

=  Y  w[k,l,i,j]RljRijx  (Al) 

( i,j)€N(k,l ) 

and 

=  Y  (A2) 

(fc,Z)GO  (i,j)eN(k,l) 


and  due  to  the  symmetry  of  the  weights,  this  is  the  same  as  the 
RHS  term.  □ 

Theorem  2:  The  matrix  A  =  ?!;[&,  Z]Rj^Rfc,z 

is  diagonal  and  positive  definite. 

Proof:  Consider  first  the  term  R^  jRfc,z  for  an  arbitrary 
(&,  Z)  position.  This  matrix  is  positive  semi-definite  by  defini¬ 
tion,  due  to  the  multiplication  of  a  matrix  by  its  adjoint.  Fur¬ 
thermore,  as  an  operator  that  manipulates  an  image,  the  multi¬ 
plication  by  Rfc5/  extracts  a  patch  from  location  (k,l).  The  later 
multiplication  by  R ^  creates  a  zero-filled  image,  with  the  same 
patch  returned  to  the  (&,  Z)  location.  Thus,  this  matrix  is  a  diag¬ 
onal  matrix,  with  ones  on  the  main  diagonal  for  pixels  belonging 
to  the  patch,  and  zeros  elsewhere. 

We  assume  that  the  weights  w[k,l,i,j\  and,  thus,  also 
w[k,l\,  are  non-negative  by  their  definition,  and,  thus,  the 
matrix  Yl(k  i)en  Z]RjfjRfctj  is  a  non-negative  weighted 
average  of  positive  semi-definite  and  diagonal  matrices.  Thus, 
the  result  is  also  diagonal  and  semi-definite.  The  addition  of 
A  •  I  preserves  the  diagonal  structure,  and  lifts  the  smallest 


Authorized  licensed  use  limited  to:  IEEE  Xplore.  Downloaded  on  December  18,  2008  at  18:02  from  IEEE  Xplore.  Restrictions  apply. 


eigenvalue  of  the  overall  matrix  to  A  >  0,  thus  leading  to  a 
strictly  positive-definite  matrix,  as  claimed.  □ 
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